Forecasting ensembles and combinations

Rob J Hyndman

15 November 2022

Introduction

Where is Melbourne?

Where is Melbourne?

Where is Melbourne?

Australian monthly café turnover

Australian monthly café turnover

Australian monthly café turnover

Australian monthly café turnover

Australian monthly café turnover

Australian monthly café turnover

Mean (or point) forecast

Interval forecast

Quantile forecasts

Tidyverts packages

The tsibble class

library(tidyverse)
library(tsibble)
library(fable)
auscafe
# A tsibble: 168 x 2 [1M]
      Month Turnover
      <mth>    <dbl>
 1 2006 Jan    1914.
 2 2006 Feb    1750 
 3 2006 Mar    1984.
 4 2006 Apr    1967.
 5 2006 May    2005.
 6 2006 Jun    1944.
 7 2006 Jul    2019.
 8 2006 Aug    2043.
 9 2006 Sep    2039.
10 2006 Oct    2113.
# … with 158 more rows

Using the fable package

auscafe |>
  filter(year(Month) <= 2017) |>
  model(
    ets = ETS(Turnover),
    arima = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    snaive = SNAIVE(Turnover)
  )
# A mable: 1 x 3
           ets                     arima   snaive
       <model>                   <model>  <model>
1 <ETS(M,A,M)> <ARIMA(0,1,1)(0,1,1)[12]> <SNAIVE>

Using the fable package

auscafe |>
  filter(year(Month) <= 2017) |>
  model(
    ets = ETS(Turnover),
    arima = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    snaive = SNAIVE(Turnover)
  ) |>
  mutate(ensemble = (ets + arima + snaive)/3)
# A mable: 1 x 4
           ets                     arima   snaive      ensemble
       <model>                   <model>  <model>       <model>
1 <ETS(M,A,M)> <ARIMA(0,1,1)(0,1,1)[12]> <SNAIVE> <COMBINATION>

Using the fable package

auscafe |>
  filter(year(Month) <= 2017) |>
  model(
    ets = ETS(Turnover),
    arima = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    snaive = SNAIVE(Turnover)
  ) |>
  mutate(ensemble = (ets + arima + snaive)/3) |>
  forecast(h = "2 years")
# A fable: 96 x 4 [1M]
# Key:     .model [4]
   .model    Month       Turnover .mean
   <chr>     <mth>         <dist> <dbl>
 1 ets    2018 Jan  N(3753, 4335) 3753.
 2 ets    2018 Feb  N(3434, 4937) 3434.
 3 ets    2018 Mar  N(3801, 7735) 3801.
 4 ets    2018 Apr  N(3736, 9189) 3736.
 5 ets    2018 May N(3782, 11269) 3782.
 6 ets    2018 Jun N(3666, 12416) 3666.
 7 ets    2018 Jul N(3876, 16022) 3876.
 8 ets    2018 Aug N(3923, 18713) 3923.
 9 ets    2018 Sep N(3878, 20630) 3878.
10 ets    2018 Oct N(4010, 24681) 4010.
# … with 86 more rows

Using the fable package

auscafe |>
  filter(year(Month) <= 2017) |>
  model(
    ets = ETS(Turnover),
    arima = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    snaive = SNAIVE(Turnover)
  ) |>
  mutate(ensemble = (ets + arima + snaive)/3) |>
  forecast(h = "2 years") |>
  accuracy(
    data = auscafe,
    measures = list(rmse=RMSE)
  ) |>
  arrange(rmse)
# A tibble: 4 × 3
  .model   .type  rmse
  <chr>    <chr> <dbl>
1 ensemble Test   28.7
2 ets      Test   88.2
3 arima    Test  111. 
4 snaive   Test  174. 

Using the fable package

Forecasting competitions

Galton (1907)

According to the democratic principle of “one vote one value”, the middlemost estimate expresses the vox populi.

Newbold and Granger (1974)

“The combined forecasting methods seem to me to be non-starters . . . Is a combined method not in danger of falling between two stools?” Maurice Priestley

Makridakis competition (1982)

Spyros Makridakis

The first genuine forecasting competition

  • 1001 series from demography, industry, economics.
  • Highly controversial at the time.
  • Showed combinations out-performed individual methods

Makridakis competition (1982)

Spyros Makridakis

The first genuine forecasting competition

  • 1001 series from demography, industry, economics.
  • Highly controversial at the time.
  • Showed combinations out-performed individual methods

Conclusions

  1. Sophisticated methods do not necessarily provide more accurate forecasts than simpler ones.
  2. The relative ranking of the various methods varies according to the accuracy measure being used.
  3. The accuracy of combinations is usually better than individual methods.
  4. The accuracy of methods depends on the length of the forecasting horizon involved.

Makridakis competition (1982)

Spyros Makridakis

The first genuine forecasting competition

  • 1001 series from demography, industry, economics.
  • Highly controversial at the time.
  • Showed combinations out-performed individual methods

Consequences

Researchers started to:

  • take combination method seriously;
  • consider how to automate forecasting methods;
  • study what methods give the best forecasts;
  • be aware of the dangers of over-fitting;
  • treat forecasting as a different problem from time series analysis.

M4 competition (2020)

  • January – May 2018
  • 100,000 time series: yearly, quarterly, monthly, weekly, daily, hourly.
  • Point forecast and prediction intervals assessed.
  • Code must be public
  • 248 registrations, 50 submissions.

M4 competition (2020)

  • January – May 2018
  • 100,000 time series: yearly, quarterly, monthly, weekly, daily, hourly.
  • Point forecast and prediction intervals assessed.
  • Code must be public
  • 248 registrations, 50 submissions.

Winning methods

  1. Hybrid of Recurrent Neural Network and Exponential Smoothing models
  2. Forecast combination using xgboost to find weights

Combining point forecasts

Weighted averages

Suppose we have N different forecasting methods.

  • \hat{\bm{y}}_{T+h|T} = N-vector of h-step-ahead forecasts using information up to time T.

  • \bm{\Sigma}_{T+h|T}= N\times N covariance matrix of the h-step forecast errors.

  • \bm{1}= unit vector.

  • Simple combination: \tilde{y}_{T+h|T} = N^{-1}\bm{1}'\hat{\bm{y}}_{T+h|T}

  • Linear combination: \tilde{y}_{T+h|T} = \bm{w}'_{T+h|T}\hat{\bm{y}}_{T+h|T}

Optimal combination (minimizing MSE):

\bm{w}_{T+h|T} = \frac{\bm{\Sigma}_{T+h|T}^{-1}\bm{1}}{\bm{1}'\bm{\Sigma}_{T+h|T}^{-1}\bm{1}'} (Bates & Granger, 1969; Granger & Newbold, 1974)

Bates and Granger (1969)

Variations

Regression weights

Regress y_{t+h} against \hat{\bm{y}}_{t+h|t} for t=1,\dots,T.

  • Gives identical results to minimum MSE approach if weights constrained to sum to 1 and no intercept.
  • Forecasts are highly correlated, so principal component regression works better.

Inverse variance weights

  • More weight on methods with smaller forecast variance

AIC weights

  • More weight on methods with lower AIC values

Yet more variations

  • Bayesian weights
  • Nonlinear combinations
  • Sparse combinations
  • Meta-learning weights

Forecast combination puzzle

Simple average combinations almost always give more accurate forecasts than other combination methods.

  • Due to the error arising from estimation the weights
  • Sample sizes are rarely large enough to do better than simple averaging
  • Optimal weights change over time making estimation even more difficult
  • Successful combination strategies have used huge collections of time series

FFORMA

Feature-based FORecast Model Averaging

  1. Take 100,000 series from M4 competition.
  2. For each series produce forecasts from 8 different models.
  3. For each series compute a large number of features (e.g., length, strength of seasonality, strength of trend, spectral entropy, autocorrelations, Hurst exponent, unit root test statistics, etc.
  4. Train xgboost to learn weights of forecast combination using features as inputs.
  • The probability of each model being best is used to construct a model weight.
  • A combination forecast is produced using these weights.
  • Came second in the M4 competition

Combining probabilistic forecasts

Ensemble forecasting

Ensemble forecasting: mix the forecast distributions from multiple models.

Ensemble forecasting

Ensemble forecasting: mix the forecast distributions from multiple models.

Forecasting ensemble (linear pooling)

  • Forecasts obtained from a weighted mixture distribution of the component forecasts.
  • Works best when individual models are over-confident and use different data sources.

Let F_{i,T+h|T}(y) be the h-step forecast distribution for the ith method. Then \begin{align*} \tilde{F}_{T+h|T}(y) &= \sum_{i=1}^N w_iF_{i,T+h|T}(y) \\ \tilde{\mu} &= \sum_{i=1}^N w_i\mu_i \\ \tilde{\sigma}^2 &= \sum_{i=1}^N w_i\sigma_i^2 + \sum_{i=1}^N w_i(\mu_i - \tilde{\mu})^2 \end{align*}

More diverse forecasts may inflate the variance, but mitigates against over-confidence.

Evaluating probabilistic forecasts

\begin{align*} f_{p,t} &= \text{quantile forecast with prob. $p$ at time $t$.}\\ y_{t} &= \text{observation at time $t$} \end{align*}

Quantile score

Q_{p,t} = \begin{cases} 2(1 - p) \big|y_t - f_{p,t}\big|, & \text{if $y_{t} < f_{p,t}$}\\ 2p \big|y_{t} - f_{p,t}\big|, & \text{if $y_{t} \ge f_{p,t}$} \end{cases}

  • Low Q_{p,t} is good
  • Multiplier of 2 often omitted, but useful for interpretation
  • Q_{p,t} like absolute error (weighted to account for likely exceedance)
  • Average Q_{p,t} over p = CRPS (Continuous Rank Probability Score)
NULL

Evaluating probabilistic forecasts

Evaluating probabilistic forecasts

Evaluating probabilistic forecasts

Evaluating probabilistic forecasts

Evaluating probabilistic forecasts

Evaluating probabilistic forecasts

Evaluating probabilistic forecasts

Ensemble forecasting

# A tibble: 4 × 4
  .model   .type  rmse  crps
  <chr>    <chr> <dbl> <dbl>
1 arima    Test   79.1  48.5
2 ensemble Test   29.7  39.6
3 ets      Test   61.4  41.3
4 snaive   Test  116.   70.6

Forecasting COVID-19 cases

Forecasting COVID-19 cases

Australian Health Protection Principal Committee

Data sources

  • Case-level data of all positive COVID-19 tests: onset and detection times.
  • Daily population mobility data from Google, Apple & Facebook
  • Weekly non-household contact surveys
  • Weekly behavioural surveys
  • Daily case numbers from many countries and regions via the Johns Hopkins COVID-19 repository

Case numbers

  • Recent case numbers are uncertain and incomplete as date of onset is not known until symptoms show and a test is obtained.

A model ensemble

Model 1: SEEIIR (Uni Melbourne/Doherty Institute)

  • Stochastic compartmental model with time-varying effective reproduction number.

Model 2: Generative model (Uni Adelaide)

  • Simulation with three types of infectious individuals: imported, asymptomatic, symptomatic

Model 3: Global AR model (Monash)

  • Single model fitted to all Johns Hopkins data from countries and regions with sufficient data.
  • Series with obvious anomalies removed.

Ensemble forecasts: Victoria

Forecastability factors

  1. we have a good understanding of the factors that contribute to it, and can measure them.
  2. there is lots of data available;
  3. the future is somewhat similar to the past
  4. the forecasts cannot affect the thing we are trying to forecast.

CRPS: Continuous Ranked Probability Score

Forecasting many series

Forecasting many series

cafe
# A tsibble: 1,344 x 3 [1M]
# Key:       State [8]
      Month State Turnover
      <mth> <chr>    <dbl>
 1 2006 Jan ACT       21.7
 2 2006 Feb ACT       21.9
 3 2006 Mar ACT       24.9
 4 2006 Apr ACT       24.8
 5 2006 May ACT       27  
 6 2006 Jun ACT       24.5
 7 2006 Jul ACT       24  
 8 2006 Aug ACT       26.1
 9 2006 Sep ACT       26.2
10 2006 Oct ACT       33.7
# … with 1,334 more rows

Forecasting many series

Forecasting many series

cafe |>
  filter(year(Month) <= 2018)
# A tsibble: 1,248 x 3 [1M]
# Key:       State [8]
      Month State Turnover
      <mth> <chr>    <dbl>
 1 2006 Jan ACT       21.7
 2 2006 Feb ACT       21.9
 3 2006 Mar ACT       24.9
 4 2006 Apr ACT       24.8
 5 2006 May ACT       27  
 6 2006 Jun ACT       24.5
 7 2006 Jul ACT       24  
 8 2006 Aug ACT       26.1
 9 2006 Sep ACT       26.2
10 2006 Oct ACT       33.7
# … with 1,238 more rows

Forecasting many series

cafe |>
  filter(year(Month) <= 2018) |>
  model(
    ETS = ETS(Turnover),
    ARIMA = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    SNAIVE = SNAIVE(Turnover)
  ) |>
  mutate(
    COMB = (ETS+ARIMA)/2
  )
# A mable: 8 x 5
# Key:     State [8]
  State           ETS                     ARIMA   SNAIVE          COMB
  <chr>       <model>                   <model>  <model>       <model>
1 ACT   <ETS(M,Ad,M)> <ARIMA(2,1,2)(0,1,1)[12]> <SNAIVE> <COMBINATION>
2 NSW   <ETS(M,Ad,M)> <ARIMA(0,1,1)(0,1,2)[12]> <SNAIVE> <COMBINATION>
3 NT     <ETS(A,N,A)> <ARIMA(1,1,0)(1,1,0)[12]> <SNAIVE> <COMBINATION>
4 QLD   <ETS(M,Ad,M)> <ARIMA(0,1,1)(1,1,1)[12]> <SNAIVE> <COMBINATION>
5 SA    <ETS(M,Ad,M)> <ARIMA(0,1,1)(0,1,1)[12]> <SNAIVE> <COMBINATION>
6 TAS    <ETS(A,N,A)> <ARIMA(0,1,1)(0,1,1)[12]> <SNAIVE> <COMBINATION>
7 VIC   <ETS(M,Ad,M)> <ARIMA(0,1,2)(0,1,2)[12]> <SNAIVE> <COMBINATION>
8 WA    <ETS(M,Ad,M)> <ARIMA(0,1,1)(0,1,2)[12]> <SNAIVE> <COMBINATION>

Forecasting many series

cafe |>
  filter(year(Month) <= 2018) |>
  model(
    ETS = ETS(Turnover),
    ARIMA = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    SNAIVE = SNAIVE(Turnover)
  ) |>
  mutate(
    COMB = (ETS+ARIMA)/2
  ) |>
  forecast(h = "1 year")
# A fable: 384 x 5 [1M]
# Key:     State, .model [32]
   State .model    Month   Turnover .mean
   <chr> <chr>     <mth>     <dist> <dbl>
 1 ACT   ETS    2019 Jan N(61, 9.6)  60.7
 2 ACT   ETS    2019 Feb  N(64, 15)  63.8
 3 ACT   ETS    2019 Mar  N(72, 26)  71.9
 4 ACT   ETS    2019 Apr  N(67, 28)  67.0
 5 ACT   ETS    2019 May  N(70, 36)  69.8
 6 ACT   ETS    2019 Jun  N(67, 39)  67.3
 7 ACT   ETS    2019 Jul  N(68, 46)  68.4
 8 ACT   ETS    2019 Aug  N(70, 53)  69.7
 9 ACT   ETS    2019 Sep  N(69, 57)  68.5
10 ACT   ETS    2019 Oct  N(70, 66)  70.2
# … with 374 more rows

Forecasting many series

cafe |>
  filter(year(Month) <= 2018) |>
  model(
    ETS = ETS(Turnover),
    ARIMA = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    SNAIVE = SNAIVE(Turnover)
  ) |>
  mutate(
    COMB = (ETS+ARIMA)/2
  ) |>
  forecast(h = "1 year") |>
  accuracy(data = cafe,
    measures = list(crps=CRPS, rmse=RMSE)
  )
# A tibble: 32 × 5
   .model State .type  crps  rmse
   <chr>  <chr> <chr> <dbl> <dbl>
 1 ARIMA  ACT   Test   1.64  2.23
 2 ARIMA  NSW   Test  18.4  28.4 
 3 ARIMA  NT    Test   2.19  3.89
 4 ARIMA  QLD   Test  15.0  24.9 
 5 ARIMA  SA    Test   4.06  6.70
 6 ARIMA  TAS   Test   1.52  2.70
 7 ARIMA  VIC   Test  30.4  48.6 
 8 ARIMA  WA    Test   9.06 14.8 
 9 COMB   ACT   Test   2.02  3.31
10 COMB   NSW   Test  17.8  14.8 
# … with 22 more rows

Forecasting many series

cafe |>
  filter(year(Month) <= 2018) |>
  model(
    ETS = ETS(Turnover),
    ARIMA = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    SNAIVE = SNAIVE(Turnover)
  ) |>
  mutate(
    COMB = (ETS+ARIMA)/2
  ) |>
  forecast(h = "1 year") |>
  accuracy(data = cafe,
    measures = list(ss=skill_score(CRPS))
  )
# A tibble: 32 × 4
   .model State .type     ss
   <chr>  <chr> <chr>  <dbl>
 1 ARIMA  ACT   Test   0.465
 2 ARIMA  NSW   Test   0.331
 3 ARIMA  NT    Test  -0.359
 4 ARIMA  QLD   Test   0.402
 5 ARIMA  SA    Test   0.213
 6 ARIMA  TAS   Test   0.438
 7 ARIMA  VIC   Test  -0.813
 8 ARIMA  WA    Test   0.117
 9 COMB   ACT   Test   0.340
10 COMB   NSW   Test   0.355
# … with 22 more rows

Forecasting many series

cafe |>
  filter(year(Month) <= 2018) |>
  model(
    ETS = ETS(Turnover),
    ARIMA = ARIMA(Turnover ~ pdq(d=1) + PDQ(D=1)),
    SNAIVE = SNAIVE(Turnover)
  ) |>
  mutate(
    COMB = (ETS+ARIMA)/2
  ) |>
  forecast(h = "1 year") |>
  accuracy(data = cafe,
    measures = list(ss=skill_score(CRPS))
  ) |>
  group_by(.model) |>
  summarise(sspc = mean(ss) * 100)
# A tibble: 4 × 2
  .model  sspc
  <chr>  <dbl>
1 ARIMA   9.91
2 COMB   14.6 
3 ETS     6.23
4 SNAIVE  0   

Skill score is relative to seasonal naive forecasts

For more information

Slides: robjhyndman.com/seminars/uscdc2022

Quantile forecasts